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Evaluation is important part of our system development cycle; it also 
contributes to improving new machine translation (MT) technology optimum 
via comparing them with the traditional systems available to determine the 
weaknesses and the effectiveness to be improved in the proposed MT 
system. This work aiming to make a study that evaluate the performance and 


effectivness of the domain sulfur industry (DSI) for English-Arabic DIA 


translator quality. The recent study has conducted evaluating by making a 
comparison between this programme with the prominent Google translator 
through applying a rendering of 1,200 English sentences in bilingual 
evaluation understudy (BLUE) method. The obtain results show that the 
efficiency of Google translator is about 30.325%, while DIA translator 
efficiency in domain sulfur industry is about 73.325% and it’s more effective 
and give a better translation accuracy. The BLUE method efficiency is about 
(90.478%) compared with the human expert evaluator. 
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1. INTRODUCTION 

Evaluating the systems of machine translation (MT) is a considerable field of researches to optimize 
the effectiveness of technologies of MT improvement cycle [1]. Evaluation refers to estimate or examine the 
validity of a particular thing. Anytime a particular novel technology is to be under development. it again 
requires updated testing or assessment on particular bases. Similarly, the requirement forevaluating the MT 
arise [2]. Given the great development in the system of MT field, as well as the prominence of the 
requirement for great speed and extremely high degree of accuracy in the information convenience 
interchangeablely between two or more languages [3], human evaluation consumes more time in addition of 
being expensive and thus inappropriate to be used repeatedly during research or develop MT system 
engines [4]. 

Evaluating of the system of MT and the system of MT itself are of the same importance, tackling 
issues concerning the interpretation of linguistic item precisely, fluently, and in an acceptable acceptable 
say [5], and then attestattesting an MT algorithm. During the last dozens of years, it has been used a huge 
number of metrics for evaluating the quality of MT, on the ground of a variety of similar standards that are 
proposed to be an independent tongue and not targeted a certain natural language. The majority of them are 
relied on comparing between the automatic translation and that of direct reference [6]. 
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Any machine translatoin accuracy is normally estimated through making a comparison between the 
outcomes of it with those of expertise human judgements [7]. The recent study has been conducted on the 
performance-based method. BiLingual evaluation understudy (BLEU) which is introduced by Papineni et al. 
[2] is a method used to evaluate MT systems, which is supposed to be autonomous language independent and 
greatly based upon the human assessment. BLEU is highly constructed on an essential notion for determining 
the goodness of a particular MT programme. It could be made briefly by the proximity of the proposed 
outcome of the MT scheme with indication to a translated text done by an (experienced human) translation of 
the text itself [8]. 

The proximity of the selected translation to the referred one is decided by a mutated n-gram 
accuracy when n={1, 2, 3, 4} [9]. The mutated n-gram accuracy is the essential standard that BLEU apply to 
differentiate among well done and weak selected translations [10], as this standard is centred on calculating 
the amount of highly occurred words in the selected translation as well as the referred rendering, followed by 
dividing the amount of the highly occurred words by the gross amount of words in the selected 
rendering [11]. The mutated n-gram accuracy determines selected linguistic structures as being shorter than 
those of referred opposite parts [12] in addition, this n-gram determines selected linguistic structures which 
have over generated correct word forms. 

English-to-Arabic MT has been an annoying and exciting research subject for a high number of 
researchers in the domain of processing standard Arabic language. A significant amount of attempts had been 
conducted for performing or improve MT from Arab language into many different ones [13]. This research 
concentrates on the assessment of the performance of the English-Arabic DIA MT software and the 
production of Google Translate. The purpose behind the recent research is to get an estimation for the 
conduction of DIA programme in comparison with that of Google Translate by dealing with a variety of text 
types directed from English into Arab language, as well as the quality of being acceptable acceptable and 
usable for the end-users. The adequacy scores [7], [14] and fluency scores are the main tests used to assess 
the quality of the translation [15]. 


2. METHOD 

The recent research adopts the BLEU method [16], [17] for evaluating DSI for English-Arabic DIA 
translator and the Google translator. The evaluation conducted automatically just supples a way that compare 
the output texts with that of human references without absolutely measurng the goodness of the translation. 
Arab language uses variety of forms and arrangements for words, so as it could communicate any idea in 
various forms. Moreover, the so many dialects existed and the merit of being expressed in various forms is 
not necessarily similar concerning the two involved languages expectedly results in the probablity of 
indicating so many meanings for only one sentence as it it is Alqudsi et al. [18]. 

In these studies, the measurement of being intelligible is centred on a pair of characteristics s, i.e. 
being fluent as well as being adequate by using BLEU-score formula. It is resulted from the division of the 
brevity penalty (BP) by the geometric mean of altered n-gram accuracies. Therefore, we must begin by 
calculating the geometric mean of n-gram's altered accuracy. After that, the size of the candidate's text (c) 
and the duration of the effective reference corpus (r) must be calculated so as to be ready for calculating the 
BP. Then the closest human judgment score is determined. In (1) [2] demonstrates the way to generate a BP 
exponentially reduced (r/c): 


1 ifc>r 
BP =| peer F 
In (2) shows the way of computing the final BLEU score: 
BLUE = BP * exp XN; wn log pn (2) 


Whereas N equals 4, while regular weights wn equals (1/N) [7]. 

The BLEU metric scores are ranging from 0 to 1 [2]; where the value (1) implies that the applicant 
text has fully matched the reference form, and the value (0) implies that the applicant text and the respective 
reference text are totally distinct. In view of the fact that the phrase is the fundamental unit in the translation 
process of the two programs assessed, it was selected as the fundamental test element. As a result of this 
research, only the output quality of sentences was assessed, the focus was on the preservation of meaning, 
which involves a comparison of meaning in output with that in the original [19]. 

Pre-processing data by separating any version into distinct n-gram dimensions, like the following: 
(uni)grams, (bi)grams, (tri)grams, and (tetra)grams. The accuracy of the DIA translator system and the 
Google translator was calculated for each of the four gram dimensions. Calculate a unified accuracy rating 
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for each of the four n-gram dimensions. These scores are then contrasted to decide which of them will get the 
highest version [20] (compare MT schemes: individual devices and system components are rated on the basis 
of how often they are considered to be superior than or equivalent to any other scheme). Algorithm that 
follows is applied to evaluate the translation as in the main steps: i) start; ii) input (source text); iii) input (two 
reference of target text); iv) translation source text by DIA translator; v) translation source text by Google 
translator; vi) automatic evaluating of DIA translator quality; vii) automatic evaluating of Google translator 
quality; viii) compare between DIA and Google output quality; ix) compare result quality by human expert 
evaluator; x) print (rank MT systems from best to worst); and xi) End. 

The BLEU is quite a rough measure of translation performance [21], Figure 1 illustrates the main 
steps of the method and the way of extracting n-grams from English, Arabic, Arab language references of 
linguistic structures for calculating BLEU scores concerning the systems of MT of DIA translate plus Google 
translator. After that, the nearest human assessment could be judged. A variety significant factors can provide 
a contribution to the bilingual evaluation understudy (BLUE) grossness [22]: 1) synonyms and paraphrases 
will only be used if they are in a collection of various reference types [23]; ii) word results are similarly 
weighted so that there is no extra punishment for missing content-bearing content [24]; and iii) the 
punishment for brevity is a stop-gap measure to compensate for the relatively severe issue of not being 
prepared to calculate recall [25]. Each of these mistakes leads to an enhanced number of inappropriately 
indistinguishable transmissions in the assessment. Since BLEU can theoretically assign equivalent scores to 
translations of manifestly distinct performance [26], it is logical that a greater BLEU rating is not possible. 
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Figure |. The structure of the main method steps 


3. RESULTS AND DISCUSSION 
We created software of automatic evaluation on Arabic MT quality (BLUE method) by applying 

Asp.net 2017 to execute this task. The quality evaluation of MT is illustrated by the quality evaluation of MT 

is illustrated by Table 1 shows, Adequacy: Does the output convey the same meaning as the input sentence? 

Is part of the message lost, added, or distorted? while Table 2 shows, Fluency: Is the output good fluent 

English? This involves both grammatical correctness and idiomatic word choices. Figure 2, which describes 

the main screen of the system of evaluating MT. Most proposed approaches for English-Arabic DIA 

translator have been tested on limited domain; sulfur industry. So, for evaluating the obtained outcomes of 
this scheme of evaluation, we selected a corpus of 1,200 phrases which are categorised under 4 criteria; 
terms, phrase, text with limited domain, and general text and then they were rendered into their counterparts 
in Arabic language by making use of each of the Babylon and Google translators. 

The results obtained through using the BLUE method by the system of DIA programme as well as 
the application of Google translation, we have reached a conclusion of the the following: 

— The analysis of using chemical symbols shows that Google data base doesn't include those symbols 
concerning the field of translating sulfur industry, on the contrary to DIA system which shows an 
integrated information of the input symbols because of being specialized informative system for 
translation in this field. 

— The analysis of terminology test which is often not more than three words, also shows that Google data 
base hardly includes few terms compared with DIA system which was able to translate them. 
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— Testing specialized sulfur industry expressions shows that DIA system is highly better than Google. 

— Testing texts of no more than 50 words shows that DIA has the priority in showing the translated 
synonyms. While, both (DIA and Google) systems were equal concerning grammatical order of sentence 
constituents. 

— Testing common texts of no more than 50 words shows that Google has the priority in the translation 
because its data base is richer than that of DIA. Concerning texts other than the field of sulfur industry. 

— It has been generally observed that the Google Translate scheme has been normally noticed to be inferior 
in most applications as it compared with the system of DIA as indicated in Table 1. 

— Finaly, by analysing human evaluation, and comparing the results by using BLUE method in the 
translations of both (DIA and Google) shows that BLUE method was of (89.875%) adequacy. 

Concerning the rate of the degree of accuracy of results for each phrase of the Google and BLUE methods 

corpus, DIA programme confirmed greater translation accuracy than that of Google Translate (73.325%) 

concerning DIA, while 30.325% concerning Google forthe performed tests. The systems of MT are 

illustrated in both of Figure 3 and Table 3 respectively. 
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Figure 2. Block diagram of BLUE evaluation 


Table 1. The scales for assigned fluency scores Table 2. Scales of scores used for assigned adequacy 


Scales Fluency Scales | Adequacy 
0.2 Incomprehensible 0.2 None 
0.4 Non-fluent 0.4 Little 
0.6 Non-native 0.6 Much 
0.8 Good 0.8 Most 

1 Flawless 1 All 


Figure 4 shows Summary of average precision, x-exis include terms of sulfur industry, phars less 
than six wordes, and sentenses less then 50 wordes, the y-exis include the range of quality unit. 
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Figure 3. Results of MT systems evaluation 


Table 3. Average precision for each type 


Criterial translator Terms Phrase Text Average precision Percentage method (%) 
DIAMT BLEU meth. 0.81 0.75 0,64 0.733 73.325 
Human tran. 0.85 0.80 0.73 0.793 79.793 
Google BLEU meth. 0.19 0.34 0.38 0.303 30.325 
Human tran. 0.24 0.39 0.41 0.347 34.675 
1.2 
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Figure 4. Summary of average precision 


4. CONCLUSION 

The recent research concludes that the automatic evaluation of MT of the efficiency of domain 
sulfur industry for DIA system using BLUE technique for determining the technique of assessment is closer 
to that human assessment. Furthermore, the recent research, refers that many experiments about the 
effectiveness concerning both internet systems of MT (i.e., Google translator and DIA translator) to translate 
1,200 English symbols of sulfur, terms and texts within the competence of the sulfur industry into Arabic 
have been conducted. Most of the applied techniques to assess automatically the accuracy of the conversion 
of scheme of MT are relied on contrasting between the texts of both applicant and the reference. 
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The obtained findings refer that the normal acquiescence accuracy concerning the system of DIA translator is 
almost about 73.325% in comparison with that accuracy concerned with the Google system of MT of nearly 
30.325% if the BLUE technique is used. The BLUE method efficiency is about (90.478%) as compared with 
the human expert evaluator. 
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